30 research outputs found

    Pipelined Model Parallelism: Complexity Results and Memory Considerations

    Get PDF
    International audienceThe training phase in Deep Neural Networks has become an important source of computing resource usage and the resulting volume of computation makes it crucial to perform efficiently on parallel architectures. Data parallelism is the most widely used method, but it requires to replicate the network weights on all processors, and to perform collective communications of the network weights. In this context, model parallelism is an attractive alternative, in which the different layers of the network are distributed over the computing processors. Indeed, it is expected to better distribute weights (to cope with memory problems) and it eliminates the need for large collective communications since only forward activations are communicated. However, to be efficient, it must be combined with a pipelined approach, which in turn induces new memory costs. In this paper, our goal is to formalize pipelined model parallelism as a scheduling problem, to establish its complexity, and to analyze the consequences of the assumptions that are typically performed in practical solutions such as Pipedream

    Une approche fondée sur la programmation linéaire pour le parallélisme de modèle

    Get PDF
    The training phase in Deep Neural Networks has become an important source of computing resource usage and because of the resulting volume of computation, it is crucial to perform it efficiently on parallel architectures. Even today, data parallelism is the most widely used method, but the associated requirement to replicate all the weights on the totality of computation resources poses problems of memory at the level of each node and of collective communications at the level of the platform. In this context, the model parallelism, which consists in distributing the different layers of the network over the computing nodes, is an attractive alternative. Indeed, it is expected to better distribute weights (to cope with memory problems) and it does not imply large collective communications since only forward activations are communicated. However, to be efficient, it must be combined with a pipelined/streaming approach, which leads in turn to new memory costs. The goal of this paper is to model these memory costs in detail and to show that it is possible to formalize this optimization problem as an Integer Linear Program (ILP).La phase d’apprentissage dans les réseaux neuronaux profonds est devenue une source importante d’utilisation des ressources de calcul et, en raison du volume de calcul qui en résulte, il est crucial de l’exécuter efficacement sur des architectures parallèles. Aujourd’hui encore, le parallélisme de données est la méthode la plus utilisée, mais l’exigence associée de répliquer tous les poids sur la totalité des ressources de calcul pose des problèmes de mémoire au niveau de chaque nœud et de communications collectives au niveau de la plateforme. Dans ce contexte, le parallélisme de modèle, qui consiste à répartir les différentes couches du réseau sur les nœuds de calcul, est une alternative intéressante. En effet, il est censé mieux répartir les poids (pour faire face aux problèmes de mémoire) et il n’implique pas de grosses communications collectives puisque seules les activations "forward" sont communiquées. Cependant, pour être efficace, elle doit être combinée avec une approche pipelinée/streaming, ce qui entraîne à son tour de nouveaux coûts mémoire. L’objectif de cet article est de modéliser ces coûts de mémoire en détail et de montrer qu’il est possible de formaliser ce problème d’optimisation comme un programme linéaire en nombre entier (ILP)

    Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training

    Get PDF
    International audienceTraining Deep Neural Networks is known to be an expensive operation, both in terms of computational cost and memory load. Indeed, during training, all intermediate layer outputs (called activations) computed during the forward phase must be stored until the corresponding gradient has been computed in the backward phase. These memory requirements sometimes prevent to consider larger batch sizes and deeper networks, so that they can limit both convergence speed and accuracy. Recent works have proposed to offload some of the computed forward activations from the memory of the GPU to the memory of the CPU. This requires to determine which activations should be offloaded and when these transfers from and to the memory of the GPU should take place. We prove that this problem is NP-hard in the strong sense, and we propose two heuristics based on relaxations of the problem. We perform extensive experimental evaluation on standard Deep Neural Networks. We compare the performance of our heuristics against previous approaches from the literature, showing that they achieve much better performance in a wide variety of situations

    Efficient Combination of Rematerialization and Offloading for Training DNNs

    Get PDF
    International audienceRematerialization and offloading are two well known strategies to save memory during the training phase of deep neural networks, allowing data scientists to consider larger models, batch sizes or higher resolution data. Rematerialization trades memory for computation time, whereas Offloading trades memory for data movements. As these two resources are independent, it is appealing to consider the simultaneous combination of both strategies to save even more memory. We precisely model the costs and constraints corresponding to Deep Learning frameworks such as PyTorch or Tensorflow, we propose optimal algorithms to find a valid sequence of memory-constrained operations and finally, we evaluate the performance of proposed algorithms on realistic networks and computation platforms. Our experiments show that the possibility to offload can remove one third of the overhead of rematerialization, and that together they can reduce the memory used for activations by a factor 4 to 6, with an overhead below 20%

    MadPipe: Memory Aware Dynamic Programming Algorithm for Pipelined Model Parallelism

    Get PDF
    The training phase in Deep Neural Networks (DNNs) is very computationally intensive and is nowadays often performed on parallel computing platforms, ranging from a few GPUs to several thousand GPUs. The strategy of choice for the parallelization of training is the so-called data parallel approach, based of the parallel training of the different inputs (typically images) and a the aggregation of network weights with collective communications (AllReduce). The scalability of this approach is limited both by the memory available on each node and the networking capacities for collective operations. Recently, a parallel model approach, in which the network weights are distributed and images are trained in a pipeline/stream manner over the computational nodes has been proposed (Pipedream, Gpipe). In this paper, we formalize in detail the optimization problem associated with the placement of DNN layers onto computation resources when using pipelined model parallelism, and we derive a dynamic programming based heuristic, MadPipe, that allows to significantly improve the performance of the parallel model approach compared to the literature

    A Makespan Lower Bound for the Scheduling of the Tiled Cholesky Factorization based on ALAP Schedule

    Get PDF
    International audienceDue to the advent of multicore architectures and massive parallelism, the tiled Cholesky factorization algorithm has recently received plenty of attention and is often referenced by practitioners as a case study. It is also implemented in mainstream dense linear algebra libraries and is used as a testbed for runtime systems. However, we note that theoretical study of the parallelism of this algorithm is currently lacking. In this paper, we present new theoretical results about the tiled Cholesky factorization in the context of a parallel homogeneous model without communication costs. Based on the relative costs of involved kernels, we prove that only two different situations must be considered, typically corresponding to CPUs and GPUs. By a careful analysis on the number of tasks of each type that run simultaneously in the ALAP (As Late As Possible) schedule without resource limitation, we are able to determine precisely the number of busy processors at any time (as degree 2 polynomials). We then use this information to find a closed form formula for the minimum time to schedule a tiled Cholesky factorization of size n on P processors. We show that this bound outperforms classical bounds from the literature. We also prove that ALAP(P), an ALAP-based schedule where the number of resources is limited to P , has a makespan extremely close to the lower bound, thus proving both the effectiveness of ALAP(P) schedule and of the lower bound on the makespan

    Optimal Memory-aware Backpropagation of Deep Join Networks

    Get PDF
    International audienceDeep Learning training memory needs can preventthe user to consider large models and large batchsizes. In this work, we propose to use techniquesfrom memory-aware scheduling and AutomaticDifferentiation (AD) to execute a backpropagationgraph with a bounded memory requirement at thecost of extra recomputations. The case of a singlehomogeneous chain, i.e. the case of a networkwhose all stages are identical and form a chain,is well understood and optimal solutions havebeen proposed in the AD literature. The networksencountered in practice in the context of DeepLearning are much more diverse, both in terms ofshape and heterogeneity.In this work, we define the class of backpropagationgraphs, and extend those on which one can computein polynomial time a solution that minimizes the totalnumber of recomputations. In particular we considerjoin graphs which correspond to models such asSiamese or Cross Modal Networks

    Weight Offloading Strategies for Training Large DNN Models

    Get PDF
    The limited memory of GPUs induces serious problems in the training phase of deep neural networks (DNNs). Indeed, with the recent tremendous increase in the size of DNN models, which can now routinely include hundreds of billions or even trillions of parameters, it is impossible to store these models in the memory of a GPU and several strategies have been devised to solve this problem. In this paper, we analyze in detail the strategy that consists in offloading the weights of some model layers from the GPU to the CPU when they are not used. Since the PCI bus bandwidth between the GPU and the CPU is limited, it is crucial to know which layers should be transferred (offloaded and prefetched) and when. We prove that this problem is in general NP-Complete in the strong sense and we propose a lower bound formulation in the form of an Integer Linear Program (ILP). We propose heuristics to select the layers to offload and to build the schedule of data transfers. We show that this approach allows to build near-optimal weight offloading strategies on realistic size DNNs and architectures

    Checkpointing optimal pour chaînes hétérogènes: apprentissage de réseaux de neurones profonds avec mémoire limitée

    Get PDF
    This paper introduces a new activation checkpointing method which allows to significantly decrease memory usage when training Deep Neural Networks with the back-propagation algorithm. Similarly to checkpoint-ing techniques coming from the literature on Automatic Differentiation, it consists in dynamically selecting the forward activations that are saved during the training phase, and then automatically recomputing missing activations from those previously recorded. We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs (this uses more memory, but requires fewer recomputations in the backward phase), and we provide an algorithm to compute the optimal computation sequence for this model. This paper also describes a PyTorch implementation that processes the entire chain, dealing with any sequential DNN whose internal layers may be arbitrarily complex and automatically executing it according to the optimal checkpointing strategy computed given a memory limit. Through extensive experiments, we show that our implementation consistently outperforms existing checkpoint-ing approaches for a large class of networks, image sizes and batch sizes.Cet article introduit une nouvelle méthode de sauvegarde des activations qui permet de réduire significavement la mémoire utilisée lors de la phase d'apprentissage de Réseaux de Neurones Profonds avec l'algorithme de rétropropagation. Cette méthode, inspirée des techniques de checkpoint en Différentiation Automatique, sélectionne dynamiquement les activations sauvegardées pendant la phase avant, puis recalcule automatiquement le sactivations manquantes à partir de celles sauvegardées précédemment. Nous proposons un modèle de calcul original qui combine deux façons de sauvegarder une activation : soit ne stocker que les entrées de la couche concernée, soit enregistrer l'historique complet des opérations qui ont permis de produire les sorties (cela utilise plus de mémoire, mais nécessite moins de recalcul dans la phase arrière). Nous présentons un algorithme qui fournit la séquence de calculer la séquence à mémoire persistente optimale pour ce modèle.Cet article décrit également une implémentation dans PyTorch qui automatise le processus, peut être utilisée avec un RNN séquentiel quelconque dont les couches internes peuvent être arbitrairement complexes, et l'exécute en suivant la stratégie optimale étant donnée une limite de mémoire. À travers de nombreuses expériences, nous montrons que notre implémentation obtient invariablement de meilleures performances que les approches existantes sur une large gamme de réseaux, tailles d'images et tailles de batch

    The ecological and coenotic features of plant communities containing Colchicum bulbocodium subsp. versicolor (Colchicaceae) in the Lower Volga region

    Get PDF
    The article presents a phytocoenotic description of 23 plant communities with Colchicum bulbocodium subsp. versicolor studied during the period of mass fl owering in 2014–2018. It was found that, across the Lower Volga region, the studied communities with C. bulbocodium subsp. versicolor are mostly confi ned to the slopes of south- and east-facing arroyos and more seldom to the southern and northern hill slopes, plains, arroyo and liman bases, and fl oodmeadows. During the period of mass fl owering, 207 vascular plants were detected in the studied communities. Every community description included 9 to 36 species. Biological diversity was assessed with the Shannon index and polydominance index; the degree of dominance was measured with the Simpson index. The species similarity of the communities was evaluated through pairwise comparison with the Jaccard coeffi cient. It was revealed that C. bulbocodium subsp. versicolor occurs in communities varying in diversity and species composition. The subspecies is not confi ned to specifi c phytocoenoses. It usually grows on rich and, more seldom, fairly rich and slightly saline soils. Their alluviality is more often weak rather than moderate. Watering usually corresponds to the dry steppe or semi-desert climate type, rarely to the middle steppe type, being moderately variable and in some cases highly variable. The impact of grazing is usually weak, but it is either moderate or strong in some communities. The communities with C. bulbocodium subsp. versicolor are dominated by hemicryptophytes: mostly tap-root, short-rhizome and long-rhizome herbaceous perennials. In phytocoenotic terms, most species belong to the zonal type of vegetation, namely steppe vegetation. The participation of meadow species is prominent. The share of weed species is rather high, which indicates a signifi cant anthropogenic load on the studied communities
    corecore